Solving Kinetic Equations on GPUs I: Model 

Kinetic Equations 



A. Frezzotti, G. P. Ghiroldi, L. Gibelli * 

Politecnico di Milano, Dipartimento di Matematica, Piazza Leonardo da Vinci 32, 

20133 Milano, Italy 



Abstract 

We present an algorithm specifically tailored for solving kinetic equations onto 
GPUs. The efficiency of the algorithm is demonstrated by solving the one-dimensional 
shock wave structure problem and a two-dimensional low Mach number driven cav- 
ity flow. Computational results show that it is possible to cut down the computing 
time of the sequential codes of two order of magnitudes. The algorithm can easily 
be extended to three-dimensional flows and more general collision models. 
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1 Introduction 



A recent trend emerging in computational physics stems from the availabil- 
ity of low cost general purpose graphics processing units (GPUs). GPUs have 
been used to accelerate CPU critical applications such as simulations of hyper- 
sonic flows [1], magnetized plasma [2] and molecular dynamics [3]. However, 
no applications to kinetic theory of gases seem to have been considered yet. 
Kinetic theory of gases deals with non-equilibrium gas flows which are met 
in several different physical situations ranging from the re-entry of spacecraft 
in upper planetary atmospheres to fluid-structure interaction in small-scale 
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devices [4,5]. The dynamics of dilute (or rarefied) gas flows is governed by the 
Boltzmann equation [6] which takes the form 

^ + ^ o V./ + -V. o (Ff) = C{f, /) (1) 
ot m 

C{fJ)^ I [f{x,v*\t)f{x,vl\t)- 

-f{x, V\t)f{x, Vi\t)] a{\\Vr\\,k O Vr)\\Vr\\dVi(fk (2) 

when written for a gas composed by a single monatomic species whose atoms 
have mass m. In Eqs. (1,2), f{x,v\t) denotes the distribution function of 
atomic velocities v at spatial location x and time t, F{x,v\t) is an assigned 
external force field, whereas C(/, /) gives the collisional rate of change of / 
at the phase space point {x,v) at time t. As is clear from Eq. (2), C{f,f) is 
a non-linear functional of /, whose precise structure depends on the assumed 
atomic interaction forces through the differential cross section (j(||i;r||,feo'yr)- 
The dynamics of binary encounters determines o" as a function of the modulus 
\\Vr\\ of the relative velocity Vr = Vi — v of two colliding atoms and of the 
orientation of the unit impact vector k with respect to Vr [7]. The collisional 
dynamics also determines the pre- collisional velocities v* and vl which are 
changed into v and Vi by a binary collision. For illustration purposes, it is 
worth mentioning that the collision integral C{f, f) simplifies to 

C{fJ)^Yl [f(^,^*\t)f(x,vl\t) - f(x,v\t)f(x,v^\t)]\koVr\dv,d'k (3) 

for a dilute gas of hard spheres of diameter d. In this case v* and vl are 
obtained from v, Vi and k by the simple relationships 



V* = V + (Vr o k)k 



vl = Vi — {Vr o k)k 



(4) 
(5) 



Obtaining numerical solutions of the Boltzmann equation for realistic fiow 
conditions is a challenging task because the unknown function depends, in 
principle, on seven variables. Moreover, the computation of C(/, /) requires the 
approximate evaluation of a fivefold integral. Numerical methods for rarefied 
gas dynamics studies can be roughly divided into three groups: 

(a) Particle methods 

(b) Semi-regular methods 

(c) Regular methods 

Methods in group (a) originate from the Direct Simulation Monte Carlo (DSMC) 
scheme proposed by G.A. Bird [8]. They are by far the most popular and 
widely used simulation methods in rarefied gas dynamics. The distribution 
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function is represented by a number of mathematical particles which move in 
the computational domain and collide according to stochastic rules derived 
from Eqs. (1,2). Macroscopic flow properties are usually obtained by time av- 
eraging particle properties. If the averaging time is long enough, then accurate 
flow simulations can be obtained by a relatively small number of particles. The 
method can be easily extended to deal with mixtures of chemically reacting 
polyatomic species [8] and to dense fluids [9]. Although DSMC (in its tradi- 
tional implementation) is to be recommended in simulating most of rarefied 
gas flows, it is not well suited to the simulation of low Mach number or un- 
steady flows. Attempts have been made to extend DSMC in order to improve 
its capability to capture the small deviations from the equilibrium condition 
met in low Mach number flows [10,11]. However, in simulating high frequency 
unsteady flows, typical of microfluidics application to MEMS [4], the possi- 
bility of time averaging is lost or reduced. Acceptable accuracy can then be 
achieved by increasing the number of simulation particles or superposing sev- 
eral flow snapshots obtained from statistically independent simulations of the 
same flow; in both cases the computing effort is considerably increased. 
Methods in groups (b) and (c) adopt similar strategies in discretizing the 
distribution function on a regular grid in the phase space and in using fi- 
nite difference schemes to approximate the streaming term on the l.h.s of 
Eq. (1). However, they differ in the way the collision integral is evaluated. 
In semi-regular methods C{f, /) is computed by Monte Carlo or quasi Monte 
Carlo quadrature methods [12,13] whereas deterministic integration schemes 
are used in regular methods [14]. Whatever method is chosen to compute the 
collision term, the adoption of a grid in the phase space considerably limits the 
applicability of methods (b) and (c) to problems where particular symmetries 
reduce the number of spatial and velocity variables. As a matter of fact, a 
spatially three-dimensional problem would require a memory demanding six- 
dimensional phase space grid. Extensions to polyatomic gases are possible [15] 
but the necessity to store additional variables associated with internal degrees 
of freedom further limits the applications to multi-dimensional flows. In spite 
of the drawbacks listed above, the direct solution of the Boltzmann equation 
by semi-regular or regular methods is a valid alternative to particle schemes 
in studying unsteady or low speed flows. Actually, when the deviation from 
equilibrium is small a limited number of grid points in the velocity space is 
sufficient to provide accurate and noise free approximations of /, therefore 
simulations of multi-dimensional flows are feasible on modern personal com- 
puters in a wide range of Knudsen numbers [16]. 

An important feature of kinetic equations for dilute gases is the locality of the 
collision term; the coUisional rate of change C(/, /) at the spatial location x is 
completely determined by f{x,v\t). Hence, the time consuming evaluation of 
the collision integral can be concurrently executed at each spatial grid point 
on parallel computers. As shown below, the numerical algorithm associated 
with regular or semi-regular methods is ideally suited for the parallel archi- 
tecture provided by commercially available CPUs. The aim of the paper is 
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to describe an efficient algorithm specifically tailored for solving kinetic equa- 
tions onto GPUs using CUDA^^ programming model [17]. The efficiency of 
the algorithm is assessed by solving the classical one-dimensional shock wave 
structure and a low speed two-dimensional driven cavity flow. It is shown that 
it is possible to cut the computing time of the sequential codes of two order 
of magnitudes by a proper reformulation of the algorithm to be executed on 
a GPU. 

In order to make the algorithm development easier, the computations pre- 
sented here have been performed by replacing C(/, /) with its simpler BGKW 
approximation [6]. This choice eliminates the intricacies connected with the 
numerical evaluation of the Boltzmann collision integral and allows easier iden- 
tification of bottlenecks and optimization strategies. As will be shown in a 
separate paper, the full Boltzmann equation can be solved within the same 
general algorithmic framework by adopting a Monte Carlo quadrature method. 
This paper is organized as follows. Section II is devoted to a concise descrip- 
tion of the mathematical model and the adopted numerical method. In Section 
III the key aspects of the GPU hardware architecture and CUDA^^ program- 
ming language are briefiy described. Sections IV and V are devoted to the 
description of the test problems and the discussion of the results. Concluding 
remarks are presented in Section VI. 



2 Theoretical and numerical background 



Both from the theoretical and computational point of view, it is often conve- 
nient to replace the full Boltzmann equation with a model equation having a 
simplified collision term. In the kinetic model proposed by Bhatnagar Gross 
and Krook [18] and independently by Welander [19], C{f, /) is replaced by the 
expression z/ ($ — /). Accordingly, Eq. (1) is turned into the following kinetic 
equation: 

^ + voV^f + ^ V, o (F/) = (6) 

In Eq. (6) z/ is the collision frequency, whereas $ is the local equilibrium 
Maxwellian distribution function given by the expression 

^x,v t) = ^ ' ^ ^,^ exp<^- ^ '/^ } 7 

^ ' ^ [27rRT{x\t)f^ I 2RT{x\t) f ^ ' 

If V does not depend on the velocity v. then conservation of mass, momentum 
and energy requires that n, V and T in Eq. (7) coincide with the local values 
of density, bulk velocity and temperature obtained from / by the relationships 
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being R the specific gas constant. The above expressions show that Eq. (6) 
is a strongly non-hnear integro-differential equation, in spite of the linear ap- 
pearance of its r.h.s.. 

As is well known, the BGKW model predicts an incorrect value of the Prandtl 
number in the hydrodynamic limit [6]. Hence, u can be adjusted to obtain 
either the correct viscosity or heat conductivity, but not both. If viscosity is 
selected, then u is given by the following expression: 

nRT 

being //(T) the gas viscosity. 



2. 1 Outline of the numerical method 



In view of the exploratory nature of the present work, Eq. (6) has been solved 
by a simple numerical method which will be illustrated on a spatially one- 
dimensional problem. The extension to two or three-dimensional geometries 
is straightforward. 

In absence of external forces and in one-dimensional slab geometry Eq. (6) 
takes the form: 

I + = (10) 

where x is the single spatial coordinate and v^. the x-componcnt of the velocity 
vector V = {vx,Vy,Vz)- The spatial domain is a finite interval of the real axis, 
divided into cells of equal size Ax. The infinite three-dimensional velocity 
space is replaced by a rectangular box divided into = Ny^ x Ny^ x A^^^ cells 
of equal volume AV, A^^,^ being the number of velocity nodes associated with 
the velocity component v^- The size and position of the "velocity box" in the 
velocity space have to be properly chosen, in order to contain the significant 
part of / at any spatial position. The distribution function is assumed to be 
constant within each cell of the phase space. Hence, / is represented by the 
array fij{t) = f{x{t),Vx{j.^),Vy{jy),v^{j;,)\t), being x{i),v^^{j^^),Vy{jy),v^{j^) 
the values of the spatial coordinate and velocity components in the center of 
the phase space cell (i, j) and j = {jx,jy,jz)- 

The algorithm that advances fij{t) to fij{t + At) is constructed by time- 
splitting the evolution operator into a free streaming step, in which the r.h.s. 
of Eq. (6) is neglected, and a purely collisional step, in which spatial motion is 
frozen and only the effect of the r.h.s. are taken into account. More precisely, 
the distribution function fl^j = fijitn) at time level t„ is advanced to its value 
/"^^ = fij{tn+i) at time level tn+i = tn + At by computing an intermediate 



value /f^^ from the free streaming equation 
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Eq. (11) is solved by a simple first order upwind scheme 



fn+l 
■'i,3 



;i2) 



After completing the free streaming step, macroscopic variables n,, Vi and Tj 
are computed at each spatial grid point and /^j^^ is finally obtained by solving 
the homogeneous relaxation equation 

f = ^(^ - /) (13) 

Since n, V and T arc conserved during homogeneous relaxation, Eq. (13) can 
be exactly solved to obtain 

f^f' = [1 - eM-^im + eM-^i^t)~f?,r (14) 

in each cell (i,j) of the phase space. The time step At has been set equal 
to a fraction of 1/I7, being V a constant such that inequality Vi < V holds 
at each spatial cell. Such limitation on At ensures good accuracy but could 
lead to violation of the stability condition of the upwind scheme used in the 
free streaming sub- step. To overcame this difficulty, we note that the exact 
solution of the streaming term is 

/(x,v,i + Ai) =/(x-^;,Ai,v,i) (15) 

Thus, for each molecular velocity, t'x(jx)) the value of the distribution func- 
tion in the cell (i, j) of the phase space can be obtained by first translating 
the distribution function by a number of cells equal to the integer part of the 
Courant number C — Vx{jx)^t/ , [C], and then applying expressions (12) 
for the residual time step advancement. It should be observed that the den- 
sity, bulk velocity and temperature obtained from the discretized Maxwellian 
distribution function $jj are not exactly equal to rzj, and Tj. To ensure 
exact conservation of mass momentum and energy, the discretized j should 
be computed from Eq. (7) by using effective values n^, Vi and Tj which are 
obtained by requiring that the moments of the discretized Maxwellian coin- 
cide with Tij, Vi and Tj [12]. The adoption of the correction method is not 
always necessary, since mass, momentum and energy errors are very small, 
even for coarse velocity space grids. Numerical tests have shown that calculat- 
ing effective local values fij, and Tj to force exact conservation of coUisional 
invariants did not affect appreciably the solutions of the problems described 
below. 

As is clear, both the free streaming and the relaxation sub-steps can be easily 
parallelized, each of them consisting of a number of independent threads. 
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3 GPU and CUDA^^ overview 



NVIDIA® GPU is built around a fully programmable processors array orga- 
nized into a number of multiprocessors with a SIMD-like architecture [17], 

i.e. at any given clock cycle, each core of the multiprocessor executes the 
same instruction but operates on different data. CUDA^^^ is the high level 
programming language specifically created for developing applications on this 
platform. 

A CUDA'^'^ program is organized into a serial program which runs on the 
host CPU and one or more kernels which define the computation to be per- 
formed in parallel by a massive number of threads. Threads are organized into 
a three-level hierarchy. At the highest level, all threads form a grid; they all 
execute the same kernel function. Each grid consists of many different blocks 
which contain the same number of threads. A single multiprocessor can man- 
age a number of blocks concurrently up to the resource limits. Blocks are 
independent, meaning that a kernel must execute correctly no matter the or- 
der in which blocks are run. A multiprocessor executes a group of threads 
beloging to the active block, called warp. All threads of a warp execute the 
same instruction but operate on different data. If a kernel contains a branch 
and threads of the same warp follow different paths, then the different paths 
are executed sequentially (warp divergence) and the total run time is the sum 
of all the branches. Divergence and reconvergence are managed in hardware 
but may have a serious impact on performance. When the instruction has 
been executed, the multiprocessor moves to another warp. In this manner the 
execution of threads is interleaved rather than simultaneous. 
Each multiprocessor has a number of registers which are dynamically parti- 
tioned among the threads running on it. Registers are memory spaces that are 
readable and writable only by the thread to which they are assigned. Threads 
of a single block are allowed to syncronize with each other and are available 
to share data through a high-speed shared memory. Threads from different 
blocks in the same grid may coordinate only via operations in a slower global 
memory space which is readable and writeable by all threads in a kernel as well 
as by the host. Shared memory can be accessed by threads within a block as 
quickly as accessing registers. On the contrary, I/O operations involving global 
memory are particularly expensive, unless access is coalesced [17]. Because of 
the interleaved warp execution, memory access latency are partially hidden, 
i.e., threads which have read their data can be performing computations while 
other warps running on the same multiprocessor are waiting for their data to 
come in from global memory. Note, however, that GPU global memory is still 
ten time faster than the main memory of recent CPUs. 

Code optimization is a delicate task. In general, applications which require 
many arithmetic operations between memory read/write, and which minimize 
the number of out-of-order memory access, tend to perform better. Number 
of blocks and number of threads per block have to be chosen carefully. There 
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should be at least as many blocks as multiprocessors in the device. Running 
only one block per multiprocessor can force the multiprocessor to idle during 
thread synchronization and device memory reads. By increasing the number 
of blocks, on the other hand, the amount of available shared memory for each 
block diminuishes. Allocating more threads per block is better for efficient 
time slicing, but the more threads per block, the fewer registers are available 
per thread. 



4 Shock wave 



4.1 Formulation of the problem 



The propagation of a planar shock wave is a classical application of kinetic 
equations which is a rather natural choice as a benchmark problem because 
of the considerable number of previous studies [20,21]. In the wave front ref- 
erence frame, the stationary flow field is assumed to be governed by the one- 
dimensional steady BGKW equation 



/) 



(16) 



X being the spatial coordinate which spans the direction normal to the (planar) 
wave front. It is further assumed that, far from the wave front, the distribution 
function f{x, v) satisfies the boundary conditions 



lim f{x, v) = ^"^{v) = 



n 



=F 



exp 



(17) 



where n^, and arc the upstream and downstream values of number 
density, velocity and temperature, respectively. The parameters of the equi- 
librium states specified by Eq. (17) are connected by the Rankine-Hugoniot 
relationships 



rv 
n~ 



4(M- 



(M-)2 + 3 



T+ _ [b{M-f - 1] [{M-f + 3] 
T- ~ 16(M-)2 



(18) 



In Eqs. (18) M denotes the upstream infinity Mach number defined as 

V- 



M- 



,1/2 



(19) 



being 7 = 5/3 the specific heat ratio of a monatomic gas. 

The numerical scheme described in section 2.1 has been adopted to obtain 

approximate solutions of Eq. (16) with boundary conditions (17) as long time 
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limit of solutions of Eq. (10) with identical boundary conditions and initial 
condition 



The computations reported have been carried out for both a weak, M~ = 1.5, 
and a medium, M~ — 3.0, shock wave. The collision frequency has been 
obtained from Eq. (9), assuming that the viscosity is given by the expression 



In Eq. (21), Tq is a reference temperature, fio is the value of the viscosity at 
the reference temperature. The temperature exponent has been set equal to 
0.74 to match the computational conditions of Ref. [20] whose results have 
been used to asses the accuracy of the calculations presented here. An adi- 
mensional form of the Eq. (10) has been adopted in actual computations by 
normalizing velocity v to y/2RT^, time t to r~ = l/iy^ and spatial coordi- 
nate X to the mean free path = \/2RT^t^ . The reference value for the 
collision frequency has been obtained by setting Tq = in Eq. (21). The 
infinite physical space has been replaced by the finite interval [— L/2,L/2] 
which has been divided into A^^, identical cells of width Ax = L/N^. The adi- 
mensional size L of the spatial domain has been set equal to 70, varying 
between 128 and 18432. The cell number has been increased well above 
the limit imposed by accuracy in order to investigate the GPU performances 
as a function of computational load. Similarly, the velocity space has been 
replaced by a parallelepiped in which each velocity component Va varies in a 
finite interval, divided into N^^ equal cells. The position of parallelepiped in 
the velocity space and the cell number vary with the chosen Mach number. In 
case of the weak shock wave, = 1.5, the same number of grid points has 
been used for the three normalized velocity components by setting N^^ — 16, 
with fj. e [—5,7] and Vy,Vz G [—6,6]. In case of the = 3.0 shock wave, 
the grid point setting has been changed to Ny^ = 30, with G [—10, 12] and 
Vy,Vz G [—11,11]. Finally, the normalized time step At has been set equal 
to 0.05. Before describing the algorithm implementation and describing the 
results, it is worth observing that, for one and two-dimensional problems, the 
dimensionality of the velocity space associated with kinetic model equations 
having the structure of Eq. (6) can be accordingly reduced to one and two, 
respectively [22] . The reduction has not been made in the present work to keep 
the general structure of a three-dimensional code and, as mentioned above, to 
investigate the hardware response to heavy computer storage demand. 




(20) 




(21) 
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4-2 CUD A™ implementation 

The code to numerically solve Eq. (16) is organized into a host program, 
which deals with all memory management and other setup tasks, and two 
kernels running on the GPU. One performs the streaming step and the other 
one performs the colhsion step and compute the macroscopic quantities as 
well. Alghorithms 1 and 2 list the pseudocodes of both kernels. For clarity of 
presentation, the pseudocode of the streaming step refers to the case of > 0. 
Because of their different impact on the code performance, we distinguish the 
slow global memory reads, and writes, from the fast reads, and 
writes, — >, from local registers and shared memory. 

As shown by Eq. (15), for each given cell of the velocity space, the streaming 
step involves the distribution function evaluated at different space locations. 
The key performance enhancing strategy is to allow threads to cooperate in 
the shared memory. The threads should thus be grouped into as many blocks 
as the cells in the velocity space with a number of threads per block equals to 
the number of cells in the physical space. In practical applications, however, 
the number of cells in the physical space is greater than the maximum allow- 
able number of threads per block. In order to fit into the device's resources, 
hence, the number of threads per block, A^^, is set to a lower value which 
is chosen to maximize the utilization of registers and shared memory usage. 
When a block become active, each thread loads one element of the distribu- 
tion function from global memory, stores it into shared memory (line 7) and 
then update its vahic according to Eqs. (12) (line 13). This procedure is then 
repeated sequentially N^/Nt times. To ensure non-overlapping access, threads 
are synchronized at the onset of both reading from and writing to the global 
memory (lines 12 and 15). In order the access to the global memory to be co- 
alesced, the discretized distribution function has been organized such that the 
value which refers to cells which are adjacent in the physical space are stored 
in contiguous memory locations. A random memory access would determine 
otherwise a performance bottleneck. Threads which update boundary points 
perform calculations which are slightly different to account for the incoming 
Maxwelhan flux from the boundary of the domain (line 5). This leads to a 
thread divergence which determines some code inefficiency. However, testing 
shows that the performance loss is small. 

According to Eq. (14), in order to perform the relaxation step in a cell of the 
phase space, no information from nearby cells is needed. Hence the concurrent 
computation of the collision operator may be possible mapping each thread to 
a single cell in the phase space. This choice naturally fits for GPUs. In order 
to evaluate the local Maxwellian, $, however, one must first calculate in each 
cell of the physical space the macroscopic quantities, Eqs. (18). In the attempt 
of reducing data transfers from and to the global memory, the computation 
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of the macroscopic quantities (lines 1-9) and the collision step (lines 11-13) 
are then performed in the same kernel, by having a thread associated to each 
cell of the physical space. Although this choice reduce the overall number of 
threads, it is not quite limiting since for realistic three-dimensional problems, 
one would probably refine the physical grid more than the velocity grid. 



4-3 Results and discussion 

In this section, we first validate the code by solving the plane shock structure 
problem and then we evaluate its performance by comparing the GPU and 
CPU execution times. We chose representative commercial products from both 
the CPU and GPU markets: Intel® Core^^ Duo Quad Q9300 running at 2.50 
GHz with 6 MB of L2 cache and with 4 GB of main memory, and an NVIDIA® 
GeForce GTX 260 with CUDA^^ version 2.0. The GTX 260 consists of 24 
streaming multiprocessors. Each multiprocessor has 8 streaming processors for 
a total of 192 units, clocked at 1.24 GHz. Each group of streaming processors 
shares one 16 kB of fast per-block shared memory while the GPU has 896 MB 
of device memory. 

Figures la and lb show the velocity and temperature profiles versus the x 
coordinate. Solid and dashed lines are the results from the numerical solution 
of Eq. (16) for M~ = 1.5 and M~ = 3, respectively. Solid circles and squares 
are the results presented in Ref. [20] for M" — 1.5 and M~ = 3, respectively. 
The agreement is good and provides a validation of the numerical code. 

The performance of the GPU implementation is compared against the single- 
threaded version running on the CPU by computing the speedup factor S = 
Tcpu/Tcpu, where Tcpu and Tqpu are the times used by the CPU and GPU to 
process at each time step one element of the discretized distribution function, 
respectively. We first examine the performance of each kernel and then analyze 
the overall speedup of the program. Times are measured after initial setup, 
e.g., after file 1/0, and do not include the time required to transfer data 
between the disjoint CPU and GPU memory spaces. 

Figure 2 shows the speedup of the streaming kernel, S^, versus the number of 
cells in the physical space. Solid line with circles and dashed line with squares 
are the results for a different number of cells in the velocity space, N^,^ = 16 
and N^^ — 30, respectively. In both cases, the speedup sharply increases and 
then level off at about 3000, where the GPU capabihty is fully exploited. 
For a greater number of cells in the physical space, the streaming step scales 
linearly with the elements of the discretized distribution function. The speedup 
decreases with the number of cells in the velocity space but, even in the worst 
case, it is still about 200. 
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Figure 3 shows the same as Fig. 2 but for the coUision kernel, Unhke the 
streaming kernel, the speedup of the collision kernel, S^, is greatly improved 
by increasing the number of cells in the velocity space. 

Figures 4a and 4b show the relative time, spent to perform the streaming 
kernel, T^, and the collision kernel, T^, versus the number of cells in the physical 
space, for M~ = 1.5 and M~ = 3, respectively. As expected, the collision is 
much more time consuming than the streaming kernel. Moreover, the relative 
time does not appear to depend strongly on the number of cells in the velocity 
space. 

Figure 5 shows the overall speedup, S^, for the two test cases. Notation is the 
same as Figs. 2 and 3. The overall speedup for each case is in between 
and S'c, which is not unexpected. In fact, St is the weighted average of the 
streaming and collision speedups with weight the relative time Tj spent by the 
GPU to execute each kernel, i.e., S't = T^S^ + T^S^. 



5 Driven cavity 



5.1 Formulation of the problem 



The driven cavity flow is a classical multidimensional benchmark problem 
since, in spite of its simple geometry, it contains most of the features which 
appear in more complicated problems described by kinetic equations. A gas 
is confined in a two-dimensional square cavity and the flow is driven by a 
uniform translation of the top with velocity U^^^x- The gas flow is supposed 
to be governed by the two-dimensional steady BGKW equation 



It is further assumed that all the walls are isothermal and that the particles 
which strike the walls are re-emitted according to the Maxwell's scattering 
kernel with complete accommodation 



f{x,v) = $w(v) 



(27ri?ro)V2 



exp 



2RTn 



{v-V^)on > (23) 



where cc is a point of the boundary, n the inward normal, the wall velocity 
and riw the wall density deflned as 



271 



{v—V^/v)on<0 



\{v - V^)on\f dv 



(24) 
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in order to impose zero net mass flux at any boundary point. 
The two-dimensional extension of the numerical scheme described in section 
2.1 has been adopted to obtain approximate solutions of Eq. (22) with bound- 
ary conditions (23) as long time limit of the unsteady problem. The adimen- 
sional form of the governing equation has been obtained as described in sec- 
tion 4. In Ref. [16], the cavity flow problem has been solved by assuming that 
V^w ^ \/2RTq and thus Eq. (22) has been linearized around the equilibrium 
state at rest. In order to reproduce these results, the dimensionless lid velocity 
is here set to 0.01. The gas is thus in a weakly non-equilibrium state and the 
nonlinear results approach the linearized ones. The square cavity, [0, 5] x [0, 5] , 
has been divided into = Ny — 160 cells with uniform width. Here 5 is the 
rarefaction parameter which is proportional to the inverse of the Knudsen 
number. The cavity flow problem has been solved over a wide range of the 
rarefaction parameter, 5 e [0.1 — 10]. The computational grid in the physical 
space has been chosen to achieve the convergence of the results in the whole 
range of rarefaction parameter considered. The number of velocity cells have 
been set N^^ = 20 with Vx,Vy,Vz G [—3,3]. Finally, the time step has been 
varied in the range 10~^ — 10~^ depending on the rarefaction parameter. 



5.2 Results and discussion 

Figures 6a and 6b show the profiles of the horizontal component of the velocity, 
Vx/Vyf, on the vertical plane crossing the center of a square cavity and the 
vertical component of the velocity, Vy/V^f^, on the horizontal plane crossing the 
center of the top vortex, respectively, for two different value of the rarefaction 
parameter, = 0.1, 10. Solid lines are the numerical results obtained by solving 
Eq. (22) with the parallel code, soUd circles are the results reported in Ref. 
[16]. The agreement is quite good. The near hnear profiles of the velocity in 
the central core of the cavity indicate the uniform vorticity region. In order to 
proceed with a more detailed comparison, we introduce two overall quantities, 
namely the mean dimensionless shear stress, D, along the moving plate and 
the dimensionless flow rate, G, of the main vortex. The former quantities is 
obtained by integrating the shear stress along the lid of the cavity, the latter 
by integrating the x-component of the velocity profile along the plane crossing 
the center of the cavity from the center of the top vortex up to the lid. Table 
1 compares the prediction of D and G obtained by solving Eq. (22) with the 
parallel code and the values reported in Ref. [16], for different values of the 
rarefaction parameter. The agreement is good, the greatest mismatch being 
for the drag coefficient at S = 10. However the discrepancy is easily removed 
by increasing the number of cells in the physical space. Figure 7 shows the 
time spent for processing one cell of the the phase space at each time step, 
expressed in nanoseconds, versus the number of cells used to discretize the 
physical space, — for the one-dimensional code and — Nj; + Ny for 
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the two-dimensional code. The sohd and dashed hnes are the results for the 
shock wave and driven cavity flow problems, respectively. The one-dimensional 
and two-dimensional codes fully utilize the GPU when the number of cells in 
the physical space is about 6000 and 9000, respectively. This difference is due 
to the fact that the one-dimensional code makes a better use of CPU's registers 
and shared memory. For a greater number of cells, the total execution time 
increases linearly and hence the time spent for processing one cell of the phase 
space at each time step is nearly constant. When the GPU is fully exploited, 
the two-dimensional code is slower than the one-dimensional code by a factor 
of about 2, which is not unexpected because of thread divergence determined 
by the greater number of boundary cells. Although the sequential version of 
the two-dimensional code has not been written, it can be safely inferred that 
the speedup of the two-dimensional code is about an half of the speedup of 
the one-dimensional code, on the basis of these results. 



6 Conclusions 



The aim of this paper is to describe the development of an algorithm to solve 
kinetic equations by exploiting the computing power of modern GPUs. Test 
gas flows have been studied by adopting the Bhatnagar-Gross-Krook-Welander 
(BGKW) kinetic model for the collision term in combination with a simple fi- 
nite difi^erence scheme. Numerical experiments with the one-dimensional shock 
wave structure problem and the two-dimensional driven cavity flow indicate 
that it is possible to cut down the computing time of the sequential codes up 
to two order of magnitudes. For instance, the solution of the two-dimensional 
unsteady driven cavity flow for 6 = 10 with A^^^ = 20, = Ny = 160 took 
about 7 minutes to execute 4100 time steps. It is worth to notice that if the 
specific two-dimensional nature of the problem is taken into account, then the 
dimensionality of the velocity space can be reduced by a standard projection 
procedure [22]. In this case, the computing time can be further reduced to 
about 40 seconds while keeping the same accuracy level. The algorithm de- 
scribed can easily be extended to three dimensions and to non-equilibrium 
fiows involving mixtures, and/or chemical reactions [23]. Extension to poly- 
atomic gas is possible as well provided that one adopts a proper representation 
of the distribution function to cope with the enlargement of the phase space 
due to internal degrees of freedom [15]. This paper describes the first stage 
in the development of algorithms for solving non-equilibrium gas fiows onto 
GPUs. The extension of this algorithm to semi-regular method of solution 
to the full Boltzmann equation is presently being investigated and it will be 
considered in a future paper. 
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Captions to Figures: 



Fig. 1: (a) mean velocity and (b) temperature profiles versus the x coordinate. 
Solid and dashed lines are the results obtained with the parallel code for 
M~ = 1.5 and M~ = 3, respectively. Solid cirlces and squares are the results 
presented in Ref. [20] for M~ — 1.5 and M~ = 3, respectively. 

Fig. 2: speedup of the streaming kernel, S^, versus the number of cells in the 
physical space, N^. Solid line with circles: Ny^ = 16; dashed line with squares: 
N,^ = 30. 

Fig. 3: speedup of the collision kernel, S^, versus the number of cells in the 
physical space, N^. Solid line with circles: A^^^ — 16; dashed line with squares: 
=30. 

Fig. 4: relative time spent on the streaming and collision kernel for (a) N^^ = 
16 and (b) N^^ — 30. Solid bar: streaming kernel; pattern bar: collision kernel. 

Fig. 5: overall speedup, S^^^, versus the number of cells in the physical space, 
Nx- Solid hne with circles: Ny^ = 16; dashed line with squares: Ny^ — 30. 

Fig. 6: profiles of (a) the horizontal component of the velocity on the vertical 
plane crossing the center of the cavity and (b) the vertical component of the 
velocity on the horizontal plane crossing the center of the top vortex. Solid 
lines: numerical solution obtained with the parallel code; solid circles: solution 
reported in Ref. [16]. 

Fig. 7: time spent for processing one cell of the phase space versus the number 
cells used to discretized the physical space. Nr. Solid line: one-dimensional 
code. Dashed line: two-dimensional code. Ny^ — 16. 

Table 1: drag coefficient, D, and reduced flow rate, G, versus the rarefaction 
parameter, 5. 
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Algorithm 1 GPU pseudo-code of the streaming step 
Require: [C] is the integer part of the Courant number 
Require: C is the fractional part of the Courant number 
Require: N^. is the number of ceUs in the physical domain 
Require: Nt is the number of threads in each block 
Require: t — . . .Nt — 1 is the index of the thread within the block 
r^N,-Nt- [C] 
w^N^-Nt 
for i = 1 to N^/Nt do 
if r + t < then 

else 

U{t + 1) ^ 

end if 

if i = then 

/sh(0) <= 
end if 

synct breads 

syncthreads 
r <— r — Nt 
w ^ w — Nt 
end for 



18 



Algorithm 2 GPU pseudocode of the collision step 
Require: i is global index of the thread inside the grid 
1: for all j do 

TT-rg ^rg "l~ /rg 
Vrg <— Vrg + Vj /rg 
Crg < Brg + |Vj| /rg 

end for 

rirg <— nrg AV 

Vrg Vrg/^rg 
^rg (Crg/jT-rg ~ l^^rgl )/3 

for all j do 

Jrg Jjj 

/rg ^ [1 - exp(-i/iAt)] + exp(-t/iAt)/rg 

Jrg ^ Jij 

end for 

nrg =^ Ui 

V ^ V- 

V rg r' V ^ 
J. rg — < J- 1 
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